Unary TEI Elements and the Token Based Corpus
نویسندگان
چکیده
The establishment of TEI as a standard for textual data generated outside of the narrow domain of corpus linguistics in history, literature, philosophy and more, has led to a fruitful integration of encoding vocabulary from different fields of interest, but at a necessary cost of a large stock of elements, heterogeneous interpretations of those elements, and limitations on the kinds of annotation combinations that a schema allows. Meanwhile in corpus and computational linguistics circles, advances in the direction of generic, vocabulary agnostic graph based models of corpus representation have gained prominence (notable examples are PAULA, Dipper 2005 and GrAF, Ide & Suderman 2007, the latter recently canonized as part of the LAF standard in ISO 24615). Graph based annotation formats lend themselves to generic, reusable query architectures, but reduce all data to having the same ontological status. Specifically, corpora in corpus linguistics center on the concept of tokens, minimal technical units of linguistic analysis, which serve as textual anchors for higher annotations (either features of the tokens, like parts of speech, or higher structures, such as syntax trees). In this paper we would like to point out a specific subset of problems caused by this dissonance between the TEI model and the token-based corpus annotation graph. We will focus on the interpretation of unary XML elements, such as line or page breaks (e.g. , ), and the representation of the underlying data structure in non-XML-based corpus query systems. Unary elements present a particular challenge for a token based corpus, since they occur within the plain text of a TEI document, yet they cover no part of the text, as shown in Figure 1.
منابع مشابه
TEITOK: Text-Faithful Annotated Corpora
TEITOK is a web-based framework for corpus creation, annotation, and distribution, that combines textual and linguistic annotation within a single TEI based XML document. TEITOK provides several built-in NLP tools to automatically (pre)process texts, and is highly customizable. It features multiple orthographic transcription layers, and a wide range of user-defined token-based annotations. For ...
متن کاملThe Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability
Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...
متن کاملThe Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability
Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...
متن کاملCorpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research
This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines...
متن کاملAn Improved Token-Based and Starvation Free Distributed Mutual Exclusion Algorithm
Distributed mutual exclusion is a fundamental problem of distributed systems that coordinates the access to critical shared resources. It concerns with how the various distributed processes access to the shared resources in a mutually exclusive manner. This paper presents fully distributed improved token based mutual exclusion algorithm for distributed system. In this algorithm, a process which...
متن کامل